Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] Update tune API for LLM hyperparameters optimization #2393

Merged

Conversation

helenxie-bit
Copy link
Contributor

What this PR does / why we need it:
This PR implements the initial functionality for LLM hyperparameter optimization in the tune API.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

@helenxie-bit
Copy link
Contributor Author

For the test example, please refer to this example in the proposal.

@google-oss-prow google-oss-prow bot added size/L and removed size/XL labels Jul 22, 2024
Signed-off-by: helenxie-bit <[email protected]>
@andreyvelich
Copy link
Member

Ref: #2339

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, adding a few initial comments.

sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/kubeflow/katib/constants/constants.py Outdated Show resolved Hide resolved
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Jul 29, 2024
@helenxie-bit
Copy link
Contributor Author

@andreyvelich When I pushed my latest changes, it was strange that only three of the end-to-end tests succeeded, while the others failed due to a Katib deployment issue. And the main reason seems to be related to the following error:

error: timed out waiting for the condition on pods/katib-mysql-77b9495867-q6txk
NAME                                 READY   STATUS    RESTARTS      AGE
katib-controller-dbc9cc-bmtf4        1/1     Running   0             2m
katib-db-manager-67b8c998f4-mmljn    1/1     Running   1 (59s ago)   2m
katib-mysql-77b9495867-q6txk         0/1     Pending   0             2m
training-operator-86d756f697-5scdr   1/1     Running   0             2m2s
Containers:
  katib-mysql:
    Image:      mysql:8.0.29
    Port:       3306/TCP
    Host Port:  0/TCP
    Args:
      --datadir
      /var/lib/mysql/datadir
    Liveness:   exec [/bin/bash -c mysqladmin ping -u root -p${MYSQL_ROOT_PASSWORD}] delay=10s timeout=1s period=5s #success=1 #failure=10
    Readiness:  exec [/bin/bash -c mysql -D ${MYSQL_DATABASE} -u root -p${MYSQL_ROOT_PASSWORD} -e 'SELECT 1'] delay=10s timeout=1s period=5s #success=1 #failure=10
    Startup:    exec [/bin/bash -c mysqladmin ping -u root -p${MYSQL_ROOT_PASSWORD}] delay=0s timeout=1s period=15s #success=1 #failure=60
    Environment:
      MYSQL_ROOT_PASSWORD:         <set to the key 'MYSQL_ROOT_PASSWORD' in secret 'katib-mysql-secrets'>  Optional: false
      MYSQL_ALLOW_EMPTY_PASSWORD:  true
      MYSQL_DATABASE:              katib
    Mounts:
      /var/lib/mysql from katib-mysql (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mgm9j (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  katib-mysql:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  katib-mysql
    ReadOnly:   false
  kube-api-access-mgm9j:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  2m    default-scheduler  0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..

I have run my code locally, and everything works fine. I also tried reverting the changes and pushing the code to the CI/CD pipelines from before the latest update, but the same error occurred. I suspect this might be due to resource limitations or a Minikube configuration issue.
Could you please help me look into this problem? I would greatly appreciate it!

@andreyvelich
Copy link
Member

@tenzen-y Any thoughts ?
@helenxie-bit Can you try to update the minikube version ?
https://github.com/kubeflow/katib/blob/master/.github/workflows/template-setup-e2e-test/action.yaml#L40C13-L40C35

@kubeflow/wg-training-leads Do you remember why we are not using Kind cluster for Katib E2Es ? Is it because we need PVC with ReadWriteMany for PBT Suggestion ?
https://github.com/kubeflow/training-operator/blob/master/.github/workflows/integration-tests.yaml#L75-L79

@helenxie-bit
Copy link
Contributor Author

@tenzen-y Any thoughts ? @helenxie-bit Can you try to update the minikube version ? https://github.com/kubeflow/katib/blob/master/.github/workflows/template-setup-e2e-test/action.yaml#L40C13-L40C35

@kubeflow/wg-training-leads Do you remember why we are not using Kind cluster for Katib E2Es ? Is it because we need PVC with ReadWriteMany for PBT Suggestion ? https://github.com/kubeflow/training-operator/blob/master/.github/workflows/integration-tests.yaml#L75-L79

@andreyvelich I have updated the Minikube version, and it is working well now. Thank you very much! Please review the latest changes when you have time 😃

@tenzen-y
Copy link
Member

@kubeflow/wg-training-leads Do you remember why we are not using Kind cluster for Katib E2Es ? Is it because we need PVC with ReadWriteMany for PBT Suggestion ?
https://github.com/kubeflow/training-operator/blob/master/.github/workflows/integration-tests.yaml#L75-L79

Yes, the reason is RWX PersistentVolume. KinD does not support RWX PV.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @helenxie-bit!
I left a few comments.
/assign @kubeflow/wg-training-leads @deepanker13

sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/kubeflow/katib/api/katib_client.py Outdated Show resolved Hide resolved
sdk/python/v1beta1/setup.py Outdated Show resolved Hide resolved
…rn type for helper functions

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@helenxie-bit
Copy link
Contributor Author

@andreyvelich Thank you for your comments! I have made the updates accordingly. Please review them when you have time.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great contribution @helenxie-bit 🎉
/lgtm
/assign @deepanker13 @johnugeorge @tenzen-y
Please take a look at the final changes

@johnugeorge
Copy link
Member

/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit e251a07 into kubeflow:master Sep 3, 2024
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants